<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <meta name="Author" content="Doug Cutting">
   <meta content="Grant Ingersoll"  name="Author">
</head>
<body>
Code to search indices.

<h2>Table Of Contents</h2>
<p>
    <ol>
        <li><a href = "#search">Search Basics</a></li>
        <li><a href = "#query">The Query Classes</a></li>
        <li><a href = "#scoring">Changing the Scoring</a></li>
    </ol>
</p>
<a name = "search"></a>
<h2>Search</h2>
<p>
Search over indices.

Applications usually call {@link
org.apache.lucene.search.Searcher#search(Query)} or {@link
org.apache.lucene.search.Searcher#search(Query,Filter)}.

    <!-- FILL IN MORE HERE -->   
</p>
<a name = "query"></a>
<h2>Query Classes</h2>
<h4>
    <a href = "TermQuery.html">TermQuery</a>
</h4>

<p>Of the various implementations of
    <a href = "Query.html">Query</a>, the
    <a href = "TermQuery.html">TermQuery</a>
    is the easiest to understand and the most often used in applications. A <a
        href="TermQuery.html">TermQuery</a> matches all the documents that contain the
    specified
    <a href = "index/Term.html">Term</a>,
    which is a word that occurs in a certain
    <a href = "document/Field.html">Field</a>.
    Thus, a <a href = "TermQuery.html">TermQuery</a> identifies and scores all
    <a href = "document/Document.html">Document</a>s that have a <a
        href="../document/Field.html">Field</a> with the specified string in it.
    Constructing a <a
        href="TermQuery.html">TermQuery</a>
    is as simple as:
    <pre>
        TermQuery tq = new TermQuery(new Term("fieldName", "term"));
    </pre>In this example, the <a href = "Query.html">Query</a> identifies all <a
        href="../document/Document.html">Document</a>s that have the <a
        href="../document/Field.html">Field</a> named <tt>"fieldName"</tt>
    containing the word <tt>"term"</tt>.
</p>
<h4>
    <a href = "BooleanQuery.html">BooleanQuery</a>
</h4>

<p>Things start to get interesting when one combines multiple
    <a href = "TermQuery.html">TermQuery</a> instances into a <a
        href="BooleanQuery.html">BooleanQuery</a>.
    A <a href = "BooleanQuery.html">BooleanQuery</a> contains multiple
    <a href = "BooleanClause.html">BooleanClause</a>s,
    where each clause contains a sub-query (<a href = "Query.html">Query</a>
    instance) and an operator (from <a
        href="BooleanClause.Occur.html">BooleanClause.Occur</a>)
    describing how that sub-query is combined with the other clauses:
    <ol>

        <li><p>SHOULD &mdash; Use this operator when a clause can occur in the result set, but is not required.
            If a query is made up of all SHOULD clauses, then every document in the result
            set matches at least one of these clauses.</p></li>

        <li><p>MUST &mdash; Use this operator when a clause is required to occur in the result set. Every
            document in the result set will match
            all such clauses.</p></li>

        <li><p>MUST NOT &mdash; Use this operator when a
            clause must not occur in the result set. No
            document in the result set will match
            any such clauses.</p></li>
    </ol>
    Boolean queries are constructed by adding two or more
    <a href = "BooleanClause.html">BooleanClause</a>
    instances. If too many clauses are added, a <a href = "BooleanQuery.TooManyClauses.html">TooManyClauses</a>
    exception will be thrown during searching. This most often occurs
    when a <a href = "Query.html">Query</a>
    is rewritten into a <a href = "BooleanQuery.html">BooleanQuery</a> with many
    <a href = "TermQuery.html">TermQuery</a> clauses,
    for example by <a href = "WildcardQuery.html">WildcardQuery</a>.
    The default setting for the maximum number
    of clauses 1024, but this can be changed via the
    static method <a href = "BooleanQuery.html#setMaxClauseCount(int)">setMaxClauseCount</a>
    in <a href = "BooleanQuery.html">BooleanQuery</a>.
</p>

<h4>Phrases</h4>

<p>Another common search is to find documents containing certain phrases. This
    is handled two different ways:
    <ol>
        <li>
            <p><a href = "PhraseQuery.html">PhraseQuery</a>
                &mdash; Matches a sequence of
                <a href = "index/Term.html">Terms</a>.
                <a href = "PhraseQuery.html">PhraseQuery</a> uses a slop factor to determine
                how many positions may occur between any two terms in the phrase and still be considered a match.</p>
        </li>
        <li>
            <p><a href = "spans/SpanNearQuery.html">SpanNearQuery</a>
                &mdash; Matches a sequence of other
                <a href = "spans/SpanQuery.html">SpanQuery</a>
                instances. <a href = "spans/SpanNearQuery.html">SpanNearQuery</a> allows for
                much more
                complicated phrase queries since it is constructed from other <a
                    href="spans/SpanQuery.html">SpanQuery</a>
                instances, instead of only <a href = "TermQuery.html">TermQuery</a>
                instances.</p>
        </li>
    </ol>
</p>
<h4>
    <a href = "RangeQuery.html">RangeQuery</a>
</h4>

<p>The
    <a href = "RangeQuery.html">RangeQuery</a>
    matches all documents that occur in the
    exclusive range of a lower
    <a href = "index/Term.html">Term</a>
    and an upper
    <a href = "index/Term.html">Term</a>.
    For example, one could find all documents
    that have terms beginning with the letters <tt>a</tt> through <tt>c</tt>. This type of <a
        href="Query.html">Query</a> is frequently used to
    find
    documents that occur in a specific date range.
</p>
<h4>
    <a href = "PrefixQuery.html">PrefixQuery</a>,
    <a href = "WildcardQuery.html">WildcardQuery</a>
</h4>

<p>While the
    <a href = "PrefixQuery.html">PrefixQuery</a>
    has a different implementation, it is essentially a special case of the
    <a href = "WildcardQuery.html">WildcardQuery</a>.
    The <a href = "PrefixQuery.html">PrefixQuery</a> allows an application
    to identify all documents with terms that begin with a certain string. The <a
        href="WildcardQuery.html">WildcardQuery</a> generalizes this by allowing
    for the use of <tt>*</tt> (matches 0 or more characters) and <tt>?</tt> (matches exactly one character) wildcards.
    Note that the <a href = "WildcardQuery.html">WildcardQuery</a> can be quite slow. Also
    note that
    <a href = "WildcardQuery.html">WildcardQuery</a> should
    not start with <tt>*</tt> and <tt>?</tt>, as these are extremely slow. 
	To remove this protection and allow a wildcard at the beginning of a term, see method
	<a href = "queryParser/QueryParser.html#setAllowLeadingWildcard(boolean)">setAllowLeadingWildcard</a> in 
	<a href = "queryParser/QueryParser.html">QueryParser</a>.
</p>
<h4>
    <a href = "FuzzyQuery.html">FuzzyQuery</a>
</h4>

<p>A
    <a href = "FuzzyQuery.html">FuzzyQuery</a>
    matches documents that contain terms similar to the specified term. Similarity is
    determined using
    <a href = "http://en.wikipedia.org//wiki/Levenshtein">Levenshtein (edit) distance</a>.
    This type of query can be useful when accounting for spelling variations in the collection.
</p>
<a name = "changingSimilarity"></a>
<h2>Changing Similarity</h2>

<p>Chances are <a href = "DefaultSimilarity.html">DefaultSimilarity</a> is sufficient for all
    your searching needs.
    However, in some applications it may be necessary to customize your <a
        href="Similarity.html">Similarity</a> implementation. For instance, some
    applications do not need to
    distinguish between shorter and longer documents (see <a
        href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">a "fair" similarity</a>).</p>

<p>To change <a href = "Similarity.html">Similarity</a>, one must do so for both indexing and
    searching, and the changes must happen before
    either of these actions take place. Although in theory there is nothing stopping you from changing mid-stream, it
    just isn't well-defined what is going to happen.
</p>

<p>To make this change, implement your own <a href = "Similarity.html">Similarity</a> (likely
    you'll want to simply subclass
    <a href = "DefaultSimilarity.html">DefaultSimilarity</a>) and then use the new
    class by calling
    <a href = "index/IndexWriter.html#setSimilarity(org.apache.lucene.search.Similarity)">IndexWriter.setSimilarity</a>
    before indexing and
    <a href = "Searcher.html#setSimilarity(org.apache.lucene.search.Similarity)">Searcher.setSimilarity</a>
    before searching.
</p>

<p>
    If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at <a
        href="http://www.nabble.com/Overriding-Similarity-tf2128934.html">Overriding Similarity</a>.
    In summary, here are a few use cases:
    <ol>
        <li><p><a href = "api/org/apache/lucene/misc/SweetSpotSimilarity.html">SweetSpotSimilarity</a> &mdash; <a
                href="api/org/apache/lucene/misc/SweetSpotSimilarity.html">SweetSpotSimilarity</a> gives small increases
            as the frequency increases a small amount
            and then greater increases when you hit the "sweet spot", i.e. where you think the frequency of terms is
            more significant.</p></li>
        <li><p>Overriding tf &mdash; In some applications, it doesn't matter what the score of a document is as long as a
            matching term occurs. In these
            cases people have overridden Similarity to return 1 from the tf() method.</p></li>
        <li><p>Changing Length Normalization &mdash; By overriding <a
                href="Similarity.html#lengthNorm(java.lang.String,%20int)">lengthNorm</a>,
            it is possible to discount how the length of a field contributes
            to a score. In <a href = "DefaultSimilarity.html">DefaultSimilarity</a>,
            lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be
            1 / (numTerms in field), all fields will be treated
            <a href = "http://www.gossamer-threads.com//lists/lucene/java-user/38967#38967">"fairly"</a>.</p></li>
    </ol>
    In general, Chris Hostetter sums it up best in saying (from <a
        href="http://www.gossamer-threads.com/lists/lucene/java-user/39125#39125">the Lucene users's mailing list</a>):
    <blockquote>[One would override the Similarity in] ... any situation where you know more about your data then just
        that
        it's "text" is a situation where it *might* make sense to to override your
        Similarity method.</blockquote>
</p>
<a name = "scoring"></a>
<h2>Changing Scoring &mdash; Expert Level</h2>

<p>Changing scoring is an expert level task, so tread carefully and be prepared to share your code if
    you want help.
</p>

<p>With the warning out of the way, it is possible to change a lot more than just the Similarity
    when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by
    <span >three main classes</span>:
    <ol>
        <li>
            <a href = "Query.html">Query</a> &mdash; The abstract object representation of the
            user's information need.</li>
        <li>
            <a href = "Weight.html">Weight</a> &mdash; The internal interface representation of
            the user's Query, so that Query objects may be reused.</li>
        <li>
            <a href = "Scorer.html">Scorer</a> &mdash; An abstract class containing common
            functionality for scoring. Provides both scoring and explanation capabilities.</li>
    </ol>
    Details on each of these classes, and their children, can be found in the subsections below.
</p>
<h4>The Query Class</h4>
    <p>In some sense, the
        <a href = "Query.html">Query</a>
        class is where it all begins. Without a Query, there would be
        nothing to score. Furthermore, the Query class is the catalyst for the other scoring classes as it
        is often responsible
        for creating them or coordinating the functionality between them. The
        <a href = "Query.html">Query</a> class has several methods that are important for
        derived classes:
        <ol>
            <li>createWeight(Searcher searcher) &mdash; A
                <a href = "Weight.html">Weight</a> is the internal representation of the
                Query, so each Query implementation must
                provide an implementation of Weight. See the subsection on <a
                    href="#The Weight Interface">The Weight Interface</a> below for details on implementing the Weight
                interface.</li>
            <li>rewrite(IndexReader reader) &mdash; Rewrites queries into primitive queries. Primitive queries are:
                <a href = "TermQuery.html">TermQuery</a>,
                <a href = "BooleanQuery.html">BooleanQuery</a>, <span
                    >OTHERS????</span></li>
        </ol>
    </p>
<h4>The Weight Interface</h4>
    <p>The
        <a href = "Weight.html">Weight</a>
        interface provides an internal representation of the Query so that it can be reused. Any
        <a href = "Searcher.html">Searcher</a>
        dependent state should be stored in the Weight implementation,
        not in the Query class. The interface defines six methods that must be implemented:
        <ol>
            <li>
                <a href = "Weight.html#getQuery()">Weight#getQuery()</a> &mdash; Pointer to the
                Query that this Weight represents.</li>
            <li>
                <a href = "Weight.html#getValue()">Weight#getValue()</a> &mdash; The weight for
                this Query. For example, the TermQuery.TermWeight value is
                equal to the idf^2 * boost * queryNorm <!-- DOUBLE CHECK THIS --></li>
            <li>
                <a href = "Weight.html#sumOfSquaredWeights()">
                    Weight#sumOfSquaredWeights()</a> &mdash; The sum of squared weights. For TermQuery, this is (idf *
                boost)^2</li>
            <li>
                <a href = "Weight.html#normalize(float)">
                    Weight#normalize(float)</a> &mdash; Determine the query normalization factor. The query normalization may
                allow for comparing scores between queries.</li>
            <li>
                <a href = "Weight.html#scorer(IndexReader)">
                    Weight#scorer(IndexReader)</a> &mdash; Construct a new
                <a href = "Scorer.html">Scorer</a>
                for this Weight. See
                <a href = "#The Scorer Class">The Scorer Class</a>
                below for help defining a Scorer. As the name implies, the
                Scorer is responsible for doing the actual scoring of documents given the Query.
            </li>
            <li>
                <a href = "Weight.html#explain(IndexReader, int)">
                    Weight#explain(IndexReader, int)</a> &mdash; Provide a means for explaining why a given document was
                scored
                the way it was.</li>
        </ol>
    </p>
<h4>The Scorer Class</h4>
    <p>The
        <a href = "Scorer.html">Scorer</a>
        abstract class provides common scoring functionality for all Scorer implementations and
        is the heart of the Lucene scoring process. The Scorer defines the following abstract methods which
        must be implemented:
        <ol>
            <li>
                <a href = "Scorer.html#next()">Scorer#next()</a> &mdash; Advances to the next
                document that matches this Query, returning true if and only
                if there is another document that matches.</li>
            <li>
                <a href = "Scorer.html#doc()">Scorer#doc()</a> &mdash; Returns the id of the
                <a href = "document/Document.html">Document</a>
                that contains the match. It is not valid until next() has been called at least once.
            </li>
            <li>
                <a href = "Scorer.html#score()">Scorer#score()</a> &mdash; Return the score of the
                current document. This value can be determined in any
                appropriate way for an application. For instance, the
                <a href = "http://svn.apache.org//viewvc/lucene/java/trunk/src/java/org/apache/lucene/search/TermScorer.java?view=log">TermScorer</a>
                returns the tf * Weight.getValue() * fieldNorm.
            </li>
            <li>
                <a href = "Scorer.html#skipTo(int)">Scorer#skipTo(int)</a> &mdash; Skip ahead in
                the document matches to the document whose id is greater than
                or equal to the passed in value. In many instances, skipTo can be
                implemented more efficiently than simply looping through all the matching documents until
                the target document is identified.</li>
            <li>
                <a href = "Scorer.html#explain(int)">Scorer#explain(int)</a> &mdash; Provides
                details on why the score came about.</li>
        </ol>
    </p>
<h4>Why would I want to add my own Query?</h4>

    <p>In a nutshell, you want to add your own custom Query implementation when you think that Lucene's
        aren't appropriate for the
        task that you want to do. You might be doing some cutting edge research or you need more information
        back
        out of Lucene (similar to Doug adding SpanQuery functionality).</p>
<h4>Examples</h4>
    <p >FILL IN HERE</p>

</body>
</html>
